The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this paper we consider techniques for improving the performance of codes for general sparse problems by extracting both local and global structure information from a sparse matrix instance. This information can be used to improve the performance of the primitives through the utilization of specialized methods for the component parts which result from the matrix decomposition. A calculus is defined...
The main purpose of a memory consistency model is to serve as an agreement between hardware system designers and software developers on the semantics of memory operations so as to ensure correct execution of user programs. However, the bulk of past work on memory consistency models has been pursued from the hardware viewpoint. In this viewpoint, a memory consistency model is used to specify certain...
This paper will provide a survey of current work on object oriented tools and techniques for metacomputing systems. More specifically, we consider the problem of designing a software component architecture that extends the current emerging desktop object composition models to the domain of high performance networks and massively parallel compute servers.
This paper presents an instruction cache prefetching mechanism capable of prefetching past branches in multiple-issue processors. Such processors at high clock rates often use small instruction caches which have significant miss rates. Prefetching from secondary cache can hide the instruction cache miss penalties but only if initiated sufficiently far ahead of the current program counter. Existing...
This session deals with fundamental issues arising in wireless networking and their implications for high performance business, consumer and military computing applications. We will describe the physical basis for wireless networking, mobile IP solutions and challenges as well as ongoing research efforts to make high performance and high confidence computing over wireless networks possible. ...
A variety of parallel computer architectures are being used today to cope with the computationally intensive tasks in the areas of image processing and computer vision. Most image processing algorithms can readily exploit SIMD (Single Instruction, Multiple Data Stream) machine architectures. The mapping of these algorithms to such machines is rather straightforward. The fine granularity parallelism...
From the beginning of computer history, there has been strong demand for faster speed of computations, while at the same time there is growing demand for lower cost including lower operating cost and ease-of-use. In order to respond to these demands, there are three major points from the technological aspects though those are not limited to high performance computing area. The state-of-art technologies...
This paper examines the use of coarse-grained multithreading to lessen the negative impact of memory access latencies on the performance of uniprocessor on-line transaction processing systems. It considers the effect of switching threads on cache misses in a two-level cache system. It also examines several different thread-switch policies. The results suggest that multithreading with a small number...
The paper presents a massively parallel implementation method of Prolog on the multithreaded parallel machine, Datarol-II. First the Logicflow model is introduced which was developed for implementing Prolog on massively parallel computers. The Logicflow is a dataflow-like graph in which nodes are macro dataflow nodes and tokens represent macrothreads. The Datarol-II architecture efficiently supports...
The Thread Synchronization Unit (TSU) is a hardware mechanism that provides data-driven thread synchronization and data consistency for multi-threaded architectures built with control-flow (i.e. commodity) microprocessors. The TSU design is based on the Decoupled Data-Driven model of execution. This model decouples the synchronization from the computation portions of a program and allows them to execute...
The technique for reducing the length of the data dependence path is presented. This technique, named tunneling-load, utilizes the register specifier buffer in order to hide the load latency, and thus reduces the length of the data dependence path. True data dependences can not be removed by any techniques such as register renaming, and are the unavoidable obstacle limiting the instruction level parallelism...
A major challenge in telecommunication design is introducing flexibility while still meeting real-time performance goals. Keeping both flexibility and performance while minimizing cost, leads to mixed hardwaresoftware systems. In the absence of a generic partitioning algorithm, accurate cost and performance modeling become crucial when exploring architectural alternatives. This paper presents a case...
We summarize an implementation of a distributed sharedmemory system on a workstation cluster. In this paper, we introduce fast serial links called Serial Transparent Asynchronous First-in Firstout Link (STAFF-Link). By using these links we construct a parallel processing system based on the workstation cluster. In the workstation cluster, a distributed shared-memory mechanism is utilized for interprocess...
In this paper, a distributed genetic algorithm (DGA) for 3-connectivity communication network design is proposed and implemented on a transputer based parallel machine, ParsyTec Gcel-164. It is emphasized that how parallelism can be used with the genetic algorithm. Performance of the (sequential) genetic algorithm (GA) is compared to Dijkstra algorithm (DA) in terms of computation time and total link...
Recursive Diagonal Torus, or RDT consisting of recursively structured tori is an interconnection network for massively parallel computers. By adding remote links to the diagonal directions of the torus network recursively, the diameter can be reduced within log2N with smaller number of links than that of hypercube. For a a n interconnection network for massively parallel computers, a routing...
Many modern machine architectures feature parallel processing at both the fine-grain and coarse-grain level. In order to efficiently utilize these multiple levels; a parallelizing compiler must orchestrate the interactions of fine-grain and coarse-grain transformations. The goal of the PROMIS compiler project is to develop a multi-source, multitarget parallelizing compiler in which the front-end and...
Optimizing inter-processor(PE) communication is crucial for parallelizing compilers for message-passing parallel machines to achieve high performance. In this paper, we propose a technique to eliminate redundant inter-PE messages. This technique utilizes a data-flow analysis to find a definition point that corresponds to a use point where the definition and the use are occurred in different PEs. If...
This paper descrives the design and implementation of the automatic vectorizing and paralellizing compiler named V-Pascal Version 3. The compiler is designed as a workbench on which various vectorizing and parallelizing techniques are evaluated. Now this compiler has the ability of vectorizing/parallelizing multiply-nested loops as reduced single loops, vectorizing while-loops and recursive calls,...
This paper presents a compiler algorithm that automatically detects the appropriate loop indices of a given nested loop and applies loop interchange and tiling in order to overlap communication with computation. It also describes method of generating communication for the tiled loop on distributed memory machines. The algorithm presented here has been implemented in our High Performance Fortran (HPF)...
For effective use of parallelizing compilers, an interactive environment which allows users to instruct the way of parallelization is needed. As the first step to build such an environment, we have developped a program visualization system named NaraView. The system provides two powerful methods for 3D visualization of program structure and data dependence. 3D visualization of program structure illustrates...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.